Code
import pandas as pd
import numpy as np
kakamana
January 23, 2023
We are going to learn how to create a few different techniques to evaluate the most important features from your dataset. we will learn how to eliminate redundant features, use text vectors to reduce the number of features in your dataset, and use principal component analysis (PCA) to reduce the number of features in your dataset.
This Selecting Features for Modeling is part of Datacamp course: Preprocessing for Machine Learning in Python
This is my learning experience of data science through DataCamp
Take an exploratory look at the post-feature engineering hiking
dataset.
Prop_ID | Name | Location | Park_Name | Length | Difficulty | Other_Details | Accessible | Limited_Access | lat | lon | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | B057 | Salt Marsh Nature Trail | Enter behind the Salt Marsh Nature Center, loc... | Marine Park | 0.8 miles | None | <p>The first half of this mile-long trail foll... | Y | N | NaN | NaN |
1 | B073 | Lullwater | Enter Park at Lincoln Road and Ocean Avenue en... | Prospect Park | 1.0 mile | Easy | Explore the Lullwater to see how nature thrive... | N | N | NaN | NaN |
2 | B073 | Midwood | Enter Park at Lincoln Road and Ocean Avenue en... | Prospect Park | 0.75 miles | Easy | Step back in time with a walk through Brooklyn... | N | N | NaN | NaN |
3 | B073 | Peninsula | Enter Park at Lincoln Road and Ocean Avenue en... | Prospect Park | 0.5 miles | Easy | Discover how the Peninsula has changed over th... | N | N | NaN | NaN |
4 | B073 | Waterfall | Enter Park at Lincoln Road and Ocean Avenue en... | Prospect Park | 0.5 miles | Easy | Trace the source of the Lake on the Waterfall ... | N | N | NaN | NaN |
Now let’s identify the redundant columns in the volunteer
dataset and perform feature selection on the dataset to return a DataFrame of the relevant features.
For example, if you explore the volunteer
dataset in the console, you’ll see three features which are related to location: locality
, region
, and postalcode
. They contain repeated information, so it would make sense to keep only one of the features.
There are also features that have gone through the feature engineering process: columns like Education
and Emergency Preparedness
are a product of encoding the categorical variable category_desc
, so category_desc
itself is redundant now.
Take a moment to examine the features of volunteer in the console, and try to identify the redundant features.
vol_requests | title | hits | category_desc | locality | region | postalcode | created_date | vol_requests_lognorm | created_month | Education | Emergency Preparedness | Environment | Health | Helping Neighbors in Need | Strengthening Communities | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2 | Web designer | 22 | Strengthening Communities | 5 22nd St\nNew York, NY 10010\n(40.74053152272... | NY | 10010.0 | 2011-01-14 | 0.693147 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
1 | 20 | Urban Adventures - Ice Skating at Lasker Rink | 62 | Strengthening Communities | NaN | NY | 10026.0 | 2011-01-19 | 2.995732 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
2 | 500 | Fight global hunger and support women farmers ... | 14 | Strengthening Communities | NaN | NY | 2114.0 | 2011-01-21 | 6.214608 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
3 | 15 | Stop 'N' Swap | 31 | Environment | NaN | NY | 10455.0 | 2011-01-28 | 2.708050 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
4 | 15 | Queens Stop 'N' Swap | 135 | Environment | NaN | NY | 11372.0 | 2011-01-28 | 2.708050 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
Index(['vol_requests', 'title', 'hits', 'category_desc', 'locality', 'region',
'postalcode', 'created_date', 'vol_requests_lognorm', 'created_month',
'Education', 'Emergency Preparedness', 'Environment', 'Health',
'Helping Neighbors in Need', 'Strengthening Communities'],
dtype='object')
title | hits | postalcode | vol_requests_lognorm | created_month | Education | Emergency Preparedness | Environment | Health | Helping Neighbors in Need | Strengthening Communities | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | Web designer | 22 | 10010.0 | 0.693147 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
1 | Urban Adventures - Ice Skating at Lasker Rink | 62 | 10026.0 | 2.995732 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
2 | Fight global hunger and support women farmers ... | 14 | 2114.0 | 6.214608 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
3 | Stop 'N' Swap | 31 | 10455.0 | 2.708050 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
4 | Queens Stop 'N' Swap | 135 | 11372.0 | 2.708050 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
Let’s take a look at the wine
dataset again, which is made up of continuous, numerical features. Run Pearson’s correlation coefficient on the dataset to determine which columns are good candidates for eliminating. Then, remove those columns from the DataFrame.
Flavanoids | Total phenols | Malic acid | OD280/OD315 of diluted wines | Hue | |
---|---|---|---|---|---|
0 | 3.06 | 2.80 | 1.71 | 3.92 | 1.04 |
1 | 2.76 | 2.65 | 1.78 | 3.40 | 1.05 |
2 | 3.24 | 2.80 | 2.36 | 3.17 | 1.03 |
3 | 3.49 | 3.85 | 1.95 | 3.45 | 0.86 |
4 | 2.69 | 2.80 | 2.59 | 2.93 | 1.04 |
Flavanoids Total phenols Malic acid \
Flavanoids 1.000000 0.864564 -0.411007
Total phenols 0.864564 1.000000 -0.335167
Malic acid -0.411007 -0.335167 1.000000
OD280/OD315 of diluted wines 0.787194 0.699949 -0.368710
Hue 0.543479 0.433681 -0.561296
OD280/OD315 of diluted wines Hue
Flavanoids 0.787194 0.543479
Total phenols 0.699949 0.433681
Malic acid -0.368710 -0.561296
OD280/OD315 of diluted wines 1.000000 0.565468
Hue 0.565468 1.000000
Let’s expand on the text vector exploration method we just learned about, using the volunteer
dataset’s title tf/idf vectors. In this first part of text vector exploration, we’re going to add to that function we learned about in the slides. We’ll return a list of numbers with the function. In the next exercise, we’ll write another function to collect the top words across all documents, extract them, and then use that list to filter down our text_tfidf
vector.
# Add in the rest of the parameters
def return_weights(vocab, original_vocab, vector, vector_index, top_n):
zipped = dict(zip(vector[vector_index].indices, vector[vector_index].data))
# Let's transform that zipped dict into a series
zipped_series = pd.Series({vocab[i]:zipped[i] for i in vector[vector_index].indices})
# Let's sort the series to pull out the top n weighted words
zipped_index = zipped_series.sort_values(ascending=False)[:top_n].index
return [original_vocab[i] for i in zipped_index]
# Print out the weighted words
print(return_weights(vocab, tfidf_vec.vocabulary_, text_tfidf, vector_index=8, top_n=3))
[189, 942, 466]
Using the function we wrote in the previous exercise, we’re going to extract the top words from each document in the text vector, return a list of the word indices, and use that list to filter the text vector down to those top words.
def words_to_filter(vocab, original_vocab, vector, top_n):
filter_list = []
for i in range(0, vector.shape[0]):
# here we'll call the function from the previous exercise,
# and extend the list we're creating
filtered = return_weights(vocab, original_vocab, vector, i, top_n)
filter_list.extend(filtered)
# Return the list in a set, so we don't get duplicate word indices
return set(filter_list)
# Call the function to get the list of word indices
filtered_words = words_to_filter(vocab, tfidf_vec.vocabulary_, text_tfidf, top_n=3)
# By converting filtered_words back to a list,
# we can use it to filter the columns in the text vector
filtered_text = text_tfidf[:, list(filtered_words)]
Let’s re-run the Naive Bayes text classification model we ran at the end of chapter 3, with our selection choices from the previous exercise, on the volunteer
dataset’s title
and category_desc
columns.
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
nb = GaussianNB()
y = volunteer['category_desc']
# Split the dataset according to the class distribution of category_desc,
# using the filtered_text vector
X_train, X_test, y_train, y_test = train_test_split(filtered_text.toarray(), y, stratify=y)
# Fit the model to the training data
nb.fit(X_train, y_train)
# Print out the model's accuracy
print(nb.score(X_test, y_test))
0.5032258064516129
You can see that our accuracy score wasn’t that different from the score at the end of chapter 3. That’s okay; the title
field is a very small text field, appropriate for demonstrating how filtering vectors works.
Let’s apply PCA to the wine
dataset, to see if we can get an increase in our model’s accuracy.
Type | Alcohol | Malic acid | Ash | Alcalinity of ash | Magnesium | Total phenols | Flavanoids | Nonflavanoid phenols | Proanthocyanins | Color intensity | Hue | OD280/OD315 of diluted wines | Proline | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 14.23 | 1.71 | 2.43 | 15.6 | 127 | 2.80 | 3.06 | 0.28 | 2.29 | 5.64 | 1.04 | 3.92 | 1065 |
1 | 1 | 13.20 | 1.78 | 2.14 | 11.2 | 100 | 2.65 | 2.76 | 0.26 | 1.28 | 4.38 | 1.05 | 3.40 | 1050 |
2 | 1 | 13.16 | 2.36 | 2.67 | 18.6 | 101 | 2.80 | 3.24 | 0.30 | 2.81 | 5.68 | 1.03 | 3.17 | 1185 |
3 | 1 | 14.37 | 1.95 | 2.50 | 16.8 | 113 | 3.85 | 3.49 | 0.24 | 2.18 | 7.80 | 0.86 | 3.45 | 1480 |
4 | 1 | 13.24 | 2.59 | 2.87 | 21.0 | 118 | 2.80 | 2.69 | 0.39 | 1.82 | 4.32 | 1.04 | 2.93 | 735 |
from sklearn.decomposition import PCA
# Set up PCA and the X vector for dimensionality reduction
pca = PCA()
wine_X = wine.drop('Type', axis=1)
# Apply PCA to the wine dataset X vector
transformed_X = pca.fit_transform(wine_X)
# Look at the percentage of variance explained by the different components
print(pca.explained_variance_ratio_)
[9.98091230e-01 1.73591562e-03 9.49589576e-05 5.02173562e-05
1.23636847e-05 8.46213034e-06 2.80681456e-06 1.52308053e-06
1.12783044e-06 7.21415811e-07 3.78060267e-07 2.12013755e-07
8.25392788e-08]
Now that we have run PCA on the wine
dataset, let’s try training a model with it.
from sklearn.neighbors import KNeighborsClassifier
y = wine['Type']
knn = KNeighborsClassifier()
# Split the transformed X and the y labels into training and test sets
X_wine_train, X_wine_test, y_wine_train, y_wine_test = train_test_split(transformed_X, y)
# Fit knn to the training data
knn.fit(X_wine_train, y_wine_train)
# Score knn on the test data and print it out
print(knn.score(X_wine_test, y_wine_test))
0.6888888888888889